Home Projects Agentic Browser API Server Website Processing API

Website Processing API

Referenced Files

api/main.py routers/website.py routers/website_validator.py models/requests/website.py models/response/website.py services/website_service.py services/website_validator_service.py prompts/website.py prompts/prompt_injection_validator.py tools/website_context/request_md.py tools/website_context/html_md.py core/config.py

Introduction#

This document describes the Website Processing API, focusing on:

Website content extraction via server-side fetching and client-provided HTML
HTML-to-Markdown conversion
Website content validation against prompt injection risks
Request/response schemas and validation requirements
End-to-end workflows for scraping, content analysis, and validation
Practical client integration patterns and limitations

Endpoints:

POST /api/genai/website/ — Process website content and answer questions
POST /api/validator/validate-website — Validate website HTML for safety

Project Structure#

The Website Processing API is implemented as a FastAPI application with modular routers, services, models, prompts, and tools.

graph TB subgraph "API Layer" APP["FastAPI App
api/main.py"] WEBSITE_RT["Router: website
routers/website.py"] VALIDATOR_RT["Router: website_validator
routers/website_validator.py"] end subgraph "Services" WS["WebsiteService
services/website_service.py"] WVS["WebsiteValidatorService
services/website_validator_service.py"] end subgraph "Models" REQ["WebsiteRequest
models/requests/website.py"] RES["WebsiteResponse
models/response/website.py"] end subgraph "Prompts" PROMPT["Prompt Chain
prompts/website.py"] PIV["Validator Prompt Template
prompts/prompt_injection_validator.py"] end subgraph "Tools" RM["Markdown Fetcher (Jina)
tools/website_context/request_md.py"] HM["HTML → Markdown
tools/website_context/html_md.py"] end APP --> WEBSITE_RT --> WS APP --> VALIDATOR_RT --> WVS WS --> RM WS --> HM WS --> PROMPT WVS --> HM WVS --> PIV REQ --> WEBSITE_RT RES --> WEBSITE_RT

Diagram sources

Section sources

api/main.py

Core Components#

Website Processing Endpoint
- Method: POST
- Path: /api/genai/website/
- Purpose: Accept a URL and a question, optionally include client HTML and chat history, and return a synthesized answer using both server-fetched and client-rendered contexts.
Website Validation Endpoint
- Method: POST
- Path: /api/validator/validate-website
- Purpose: Validate HTML content for prompt injection risks by converting to Markdown and evaluating with a language model.

Section sources

Architecture Overview#

End-to-end flow for website processing and validation:

sequenceDiagram participant C as "Client" participant API as "FastAPI App" participant R as "website Router" participant S as "WebsiteService" participant RM as "Markdown Fetcher (Jina)" participant HM as "HTML → Markdown" participant LLM as "LLM Chain" C->>API : "POST /api/genai/website/" API->>R : "Dispatch request" R->>S : "generate_answer(url, question, chat_history, client_html)" S->>RM : "Fetch server markdown for URL" RM-->>S : "Markdown content" alt "client_html provided" S->>HM : "Convert client HTML to Markdown" HM-->>S : "Client Markdown" else "no client_html" S->>S : "Use empty client context" end S->>LLM : "Prompt with server/client context + question + chat history" LLM-->>S : "Answer" S-->>R : "Answer" R-->>C : "{ answer }"

Diagram sources

Detailed Component Analysis#

Website Processing Endpoint#

URL: POST /api/genai/website/
Request Schema (WebsiteRequest)
- url: string (required)
- question: string (required)
- chat_history: array of objects (optional; default: empty)
- client_html: string (optional; if provided, converted to Markdown)
- attached_file_path: string (optional; if provided, uses Google AI SDK to process)
Response Schema (WebsiteResponse)
- answer: string
Processing Logic
- Server-side markdown fetch via Jina AI
- Optional client HTML to Markdown conversion
- Chat history aggregation into a string
- Optional file upload and generation via Google AI SDK
- Prompt composition and LLM invocation
- Answer returned as plain text
Validation Requirements
- url and question are required; otherwise returns 400
- Errors are logged and surfaced as 500 with details
Rate Limiting and Content Filtering
- No explicit rate limiting in code
- Jina AI service may apply limits; consider retries/backoff in clients
- Content filtering is implicit via prompt instructions and validator endpoint

flowchart TD Start(["Request Received"]) --> Validate["Validate url and question"] Validate --> Valid{"Valid?"} Valid --> |No| Err400["Return 400 Bad Request"] Valid --> |Yes| Fetch["Fetch server markdown via Jina"] Fetch --> HasClient{"client_html provided?"} HasClient --> |Yes| Convert["Convert client HTML to Markdown"] HasClient --> |No| SkipConvert["Skip conversion"] Convert --> Merge["Merge contexts"] SkipConvert --> Merge Merge --> Attached{"attached_file_path provided?"} Attached --> |Yes| Upload["Upload file via Google AI SDK"] Upload --> Gen["Generate content with LLM"] Attached --> |No| Gen Gen --> Done(["Return answer"]) Err400 --> Done

Diagram sources

Section sources

Website Validation Endpoint#

URL: POST /api/validator/validate-website
Request Schema (WebsiteValidatorRequest)
- html: string (required)
Response Schema (WebsiteValidatorResponse)
- is_safe: boolean (default: false)
Processing Logic
- Convert HTML to Markdown
- Build a validation prompt with the Markdown content
- Invoke LLM to classify as safe or unsafe
- Return boolean flag indicating safety
Validation Requirements
- html is required; ensure proper HTML payload
Security Notes
- Designed to detect prompt injection attempts by analyzing Markdown representation of HTML

flowchart TD VStart(["Validation Request"]) --> ToMd["Convert HTML to Markdown"] ToMd --> BuildPrompt["Build validation prompt"] BuildPrompt --> InvokeLLM["Invoke LLM classification"] InvokeLLM --> Classify{"Result == 'true'?"} Classify --> |Yes| Safe["is_safe = true"] Classify --> |No| Unsafe["is_safe = false"] Safe --> VDone(["Return response"]) Unsafe --> VDone

Diagram sources

Section sources

Supporting Tools and Prompts#

Server-side Markdown Fetcher (Jina AI)
- Converts a URL into clean Markdown via an external service
- Returns raw text/markdown or an error message string
HTML to Markdown Converter
- Parses HTML and converts to Markdown for downstream processing
Prompt Chains
- Website prompt composes server and client contexts with question and chat history
- Validator prompt checks for prompt injection indicators

Section sources

Dependency Analysis#

API Registration
- Routers mounted under specific prefixes:
  - /api/genai/website (website router)
  - /api/validator (website validator router)
Service Dependencies
- WebsiteService depends on:
  - Markdown fetcher (Jina)
  - HTML-to-Markdown converter
  - Prompt chain (LangChain)
- WebsiteValidatorService depends on:
  - HTML-to-Markdown converter
  - Validator prompt template
  - LLM client
External Integrations
- Jina AI service for server-side markdown fetching
- Optional Google AI SDK for file processing when attached_file_path is provided

graph LR API["api/main.py"] --> RT1["routers/website.py"] API --> RT2["routers/website_validator.py"] RT1 --> SVC1["services/website_service.py"] RT2 --> SVC2["services/website_validator_service.py"] SVC1 --> JINA["tools/website_context/request_md.py"] SVC1 --> HM["tools/website_context/html_md.py"] SVC1 --> PROMPT["prompts/website.py"] SVC2 --> HM SVC2 --> VPROMPT["prompts/prompt_injection_validator.py"]

Diagram sources

Section sources

api/main.py

Performance Considerations#

Latency Factors
- Network latency to Jina AI service for server-side markdown fetching
- Optional Google AI SDK file upload and generation
- LLM inference time for prompt evaluation
Recommendations
- Cache server markdown for repeated queries to the same URL
- Compress or truncate very large client_html payloads
- Implement client-side retry/backoff for transient failures from external services
- Consider batching multiple requests when feasible

[No sources needed since this section provides general guidance]

Troubleshooting Guide#

Common HTTP Errors
- 400 Bad Request: Missing url or question in request
- 500 Internal Server Error: Unhandled exceptions during processing
Error Handling Behavior
- Website endpoint logs errors and returns a structured 500 response
- Validation endpoint returns a deterministic boolean; ensure input HTML is well-formed
Environment and Configuration
- Ensure environment variables for logging and optional Google API keys are set appropriately
Client-Side Tips
- Validate request payloads before sending
- Handle network timeouts and retry logic for external services
- Normalize chat_history entries to dictionaries with role/content fields

Section sources

Conclusion#

The Website Processing API provides a robust pipeline for extracting, converting, and analyzing website content, while offering a dedicated validation endpoint to mitigate prompt injection risks. By combining server-side and client-side contexts, it delivers comprehensive answers grounded in both static and dynamic page content. Clients should implement appropriate retries, payload normalization, and error handling to integrate smoothly with the API.

[No sources needed since this section summarizes without analyzing specific files]

Appendices#

API Reference#

Website Processing
- Method: POST
- URL: /api/genai/website/
- Request Body Fields
  - url: string (required)
  - question: string (required)
  - chat_history: array of objects (optional)
  - client_html: string (optional)
  - attached_file_path: string (optional)
- Response Body Fields
  - answer: string
Website Validation
- Method: POST
- URL: /api/validator/validate-website
- Request Body Fields
  - html: string (required)
- Response Body Fields
  - is_safe: boolean

Section sources

Example Workflows#

Website Scraping and Analysis
- Steps
  - Send POST to /api/genai/website/ with url and question
  - Optionally include client_html to capture client-rendered content
  - Optionally include chat_history for conversational context
  - Receive answer synthesized from both server and client contexts
- Notes
  - If attached_file_path is provided, the service uploads the file and generates content using the Google AI SDK
Content Validation
- Steps
  - Send POST to /api/validator/validate-website with html payload
  - Receive is_safe boolean indicating whether the content is considered safe

Section sources

Client Implementation Patterns#

Basic Client Call Pattern
- Construct request payload with url and question
- Set Content-Type to application/json
- Handle non-OK responses and parse JSON on success
Integration Tips
- For dynamic pages, capture client HTML in the browser and pass client_html
- For multi-turn conversations, accumulate chat_history entries
- For sensitive documents, consider uploading files via the attached_file_path path when supported

Section sources

Previous Search Integration API

Next You Tube Integration API

Agentic Browser

AI Agent System

API Server

Browser Automation

Browser Extension

Data Models And Schemas

Prompts And Prompt Engineering

Service Integrations

System Architecture

Tool System

Website Processing API

Table of Contents#

Introduction#

Project Structure#

Core Components#

Architecture Overview#

Detailed Component Analysis#

Website Processing Endpoint#

Website Validation Endpoint#

Supporting Tools and Prompts#

Dependency Analysis#

Performance Considerations#

Troubleshooting Guide#

Conclusion#

Appendices#

API Reference#

Example Workflows#

Client Implementation Patterns#